Forest cover types and land cover plays an key role in environmental assessment. Accurate information of natural resources is important to many different entities like private, local government, federal agencies, conservation agencies. Normally, land cover data are generated by remote sensing data. However, those data set can be hard and costly to process. Therefore, we can try to use cartographic data to predict forest cover types. There are various supervised classification algorithms we can utilize in this dataset, included K-Nearest Neighbors, Support Vector Machine, Tree based methods, Neural Network. In this project, we will try to use as many methods as we can and compare the results. The original effect to classify the forest cover type on this dataset was able to achieve 70.52% classification accuracy using artificial neural network.
The main problem we are trying to solve is how we can predict forest cover type based on cartographic information. Which model performs the best on this classification task? Can we apply the same model to different regions? The results can be applied in some other analysis like fire hazard prevention, nature asset management, climate change, etc.
Another problem I am trying to address here is the trade-off of using remote sensing data. In many cases, remote sensing data is useful and beneficial. It can cover a large number of areas and places humans cannot reach in person. It also has the temporal element, allowing us to see the dynamic change of the environment. However, it also comes with the disadvantage. The files are too really large, so it requires good processing power. Also, remote sensing can be interfered by other phenomena like the weather. The main goal here is to see whether we can predict tree type just use cartographic information. If I have time, I will also try to combine remote sensing data with basic cartographic information. Would the prediction perform better with data from different origins and dimensions?
This project includes a supervised classification task. We will randomly devide our dataset in to training and tesing. Cross validation will be performed on the training dataset to get the best hyperparameter for each model. The main objective of this project is to find a method with high test accuracy on classifying forest cover types. The second objective is to achieve a high score on Kaggle's competition. The third objective is to use remote sensing to increase overall accuracy of the best model.
This dataset was retrieved from UCI Machine Learning Repository. It was originally from Jock A. Blackard in USFS and Dr. Denis J. Dean in UT Dallas.The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. The are seven forest cover type classes: lodgepole pine, spruce/fir , ponderosa pine (Pinus ponderosa), Douglas-fir, aspen, cottonwood/willow, and krummholz.
In this section, I will perform some analysis and basic visualiztion of the tree cover type dataset.
#install packages
!pip install prince
#import basic packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
import io
import warnings
warnings.filterwarnings('ignore')
#conver csv to pandas dataframe
treetype = pd.read_csv('train.csv')
There are 15120 records in the training set and test set is avilible on Kaggle. There are enough data for us to train and validate.
#@title
treetype.shape
(15120, 56)
Here is a look at the first 5 rows of data.
#@title
treetype.head()
| Id | Elevation | Aspect | Slope | Horizontal_Distance_To_Hydrology | Vertical_Distance_To_Hydrology | Horizontal_Distance_To_Roadways | Hillshade_9am | Hillshade_Noon | Hillshade_3pm | ... | Soil_Type32 | Soil_Type33 | Soil_Type34 | Soil_Type35 | Soil_Type36 | Soil_Type37 | Soil_Type38 | Soil_Type39 | Soil_Type40 | Cover_Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2596 | 51 | 3 | 258 | 0 | 510 | 221 | 232 | 148 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 |
| 1 | 2 | 2590 | 56 | 2 | 212 | -6 | 390 | 220 | 235 | 151 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 |
| 2 | 3 | 2804 | 139 | 9 | 268 | 65 | 3180 | 234 | 238 | 135 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 3 | 4 | 2785 | 155 | 18 | 242 | 118 | 3090 | 238 | 238 | 122 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 4 | 5 | 2595 | 45 | 2 | 153 | -1 | 391 | 220 | 234 | 150 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 |
5 rows × 56 columns
Some names of attributes are way too long, therefore I renames some columns.
#@title
#Rename some columns to make it simple
treetype=treetype.rename(columns={"Horizontal_Distance_To_Hydrology": "HDis_Hydro", "Vertical_Distance_To_Hydrology": "VDis_Hydro"})
treetype=treetype.rename(columns={"Horizontal_Distance_To_Roadways": "HDis_Rd", "Horizontal_Distance_To_Fire_Points": "HDis_Fire"})
Here is a description of all columns in this dataset.
Elevation - Elevation in meters \ Aspect - Aspect in degrees azimuth \ Slope - Slope in degrees \ Horizontal_Distance_To_Hydrology - Horz Dist to nearest surface water features \ Vertical_Distance_To_Hydrology - Vert Dist to nearest surface water features \ Horizontal_Distance_To_Roadways - Horz Dist to nearest roadway \ Hillshade_9am (0 to 255 index) - Hillshade index at 9am, summer solstice \ Hillshade_Noon (0 to 255 index) - Hillshade index at noon, summer solstice \ Hillshade_3pm (0 to 255 index) - Hillshade index at 3pm, summer solstice \ Horizontal_Distance_To_Fire_Points - Horz Dist to nearest wildfire ignition points \ Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) - Wilderness area designation \ Soil_Type (40 binary columns, 0 = absence or 1 = presence) - Soil Type designation \ Cover_Type (7 types, integers 1 to 7) - Forest Cover Type designation \
First, we get the basic information and statitics of each varibles.
treetype.describe()
| Id | Elevation | Aspect | Slope | HDis_Hydro | VDis_Hydro | HDis_Rd | Hillshade_9am | Hillshade_Noon | Hillshade_3pm | ... | Soil_Type32 | Soil_Type33 | Soil_Type34 | Soil_Type35 | Soil_Type36 | Soil_Type37 | Soil_Type38 | Soil_Type39 | Soil_Type40 | Cover_Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 15120.00000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | ... | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 | 15120.000000 |
| mean | 7560.50000 | 2749.322553 | 156.676653 | 16.501587 | 227.195701 | 51.076521 | 1714.023214 | 212.704299 | 218.965608 | 135.091997 | ... | 0.045635 | 0.040741 | 0.001455 | 0.006746 | 0.000661 | 0.002249 | 0.048148 | 0.043452 | 0.030357 | 4.000000 |
| std | 4364.91237 | 417.678187 | 110.085801 | 8.453927 | 210.075296 | 61.239406 | 1325.066358 | 30.561287 | 22.801966 | 45.895189 | ... | 0.208699 | 0.197696 | 0.038118 | 0.081859 | 0.025710 | 0.047368 | 0.214086 | 0.203880 | 0.171574 | 2.000066 |
| min | 1.00000 | 1863.000000 | 0.000000 | 0.000000 | 0.000000 | -146.000000 | 0.000000 | 0.000000 | 99.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 3780.75000 | 2376.000000 | 65.000000 | 10.000000 | 67.000000 | 5.000000 | 764.000000 | 196.000000 | 207.000000 | 106.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 50% | 7560.50000 | 2752.000000 | 126.000000 | 15.000000 | 180.000000 | 32.000000 | 1316.000000 | 220.000000 | 223.000000 | 138.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.000000 |
| 75% | 11340.25000 | 3104.000000 | 261.000000 | 22.000000 | 330.000000 | 79.000000 | 2270.000000 | 235.000000 | 235.000000 | 167.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 |
| max | 15120.00000 | 3849.000000 | 360.000000 | 52.000000 | 1343.000000 | 554.000000 | 6890.000000 | 254.000000 | 254.000000 | 248.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 7.000000 |
8 rows × 56 columns
From the pair plot below, we can see elevation can seperate forest cover type the best. Especially when it combine with Aspect and Slope. There are few outliers in the graph, we will look into this to see if its measurement error.
#@title
sns.set_theme(style="ticks")
plt.style.use('seaborn')
treetypee_sub=treetype.iloc[:,np.r_[1:11,-1]].copy()
sns.pairplot(treetypee_sub, hue="Cover_Type",palette="Paired")
<seaborn.axisgrid.PairGrid at 0x23f00751cc8>
There is no missing data in this dataset.
treetype.loc[:, treetype.isnull().any()].columns
Index([], dtype='object')
Principle Component Analysis\ Principle Component are not supposed to work with binary data. The results from PCA are not ideal in my opinion. There is no clear pattern of principle Components of different tree species.
from sklearn.decomposition import PCA
from sklearn import preprocessing
# exclude predicted variable
treetype_PCA=treetype.iloc[:,np.r_[1:55]].copy()
# Normalize the data
treetype_normal = preprocessing.normalize(treetype_PCA)
# create new target variable
treetype_target=treetype.iloc[:,-1].copy()
treetype_target=treetype_target.astype('category')
pca = PCA(3)
projected = pca.fit_transform(treetype_normal)
plt.scatter(projected[:, 0], projected[:, 1],
c=treetype_target, edgecolor='none', alpha=0.7,
cmap=plt.cm.get_cmap('Paired_r', 7))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();
PCA performed poorly on this dataset. Let's take a look at the first three priciple Components and plot a 3d graph of them.
import prince
import pprint
from mpl_toolkits.mplot3d import Axes3D
import plotly.offline as pyo
pyo.init_notebook_mode()
import plotly.express as px
fig = px.scatter_3d(projected,x=0,y=1,z=2,color=treetype_target,opacity=0.7,color_discrete_sequence=px.colors.qualitative.D3)
fig.update_traces(marker=dict(size=2))
fig.show()
FAMD
PCA did not quite do the job. Multiple Factor analysis of mixed data is an alternative to PCA. Multiple Factor analysis of mixed data (FAMD) clearly has a better results than PCA here, It gives a better pattern to dissect different forest covers. I created a similar 3-D plot from the FAMD results, and it looks good. However, the explained variablity is still low, and different classes are overlapping.
# Craete new dataframe with categorical variable wilderness and soil from the dummy variables
wilderness=treetype_PCA.iloc[:,np.r_[10:14]].idxmax(axis=1,)
soil=treetype_PCA.iloc[:,np.r_[14:54]].idxmax(axis=1)
# Concat them back together
treetype_FAMD=pd.concat([treetype_PCA.iloc[:,np.r_[0:10]],wilderness,soil],axis=1)
import prince
import pprint
from mpl_toolkits.mplot3d import Axes3D
import plotly.offline as pyo
pyo.init_notebook_mode()
# Instantiate FAMD object
famd = prince.FAMD(
n_components=3,
n_iter=20,
copy=True,
check_input=True,
engine='auto',
random_state=42)
# Fit FAMD object to data
famd = famd.fit(treetype_FAMD)
projected=np.array(famd.row_coordinates(treetype_FAMD))
print(famd.explained_inertia_)
# Plot 3D scatter
import plotly.express as px
fig = px.scatter_3d(projected,x=0,y=1,z=2,color=treetype_target,opacity=0.7,color_discrete_sequence=px.colors.qualitative.D3)
fig.update_traces(marker=dict(size=2))
fig.show()
[0.06930829 0.04863929 0.03894013]
We would apply serval classification techniques on this datasets. Logistic regression is clearly a no here since it invloves 7 different classes. K nearest neighbor would be our first try here.
Our first step would be split training/validation and test dataset.
# train test split
from sklearn.model_selection import train_test_split
X=treetype.iloc[:,np.r_[1:55]].copy()
y=treetype.iloc[:,-1].copy()
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
# recode target value since multiple packages require first class start as 0, then 1,2,3,...
y_train=y_train-1
y_test=y_test-1
Our task is a multi-class classification task. I included Logistic regression, Linear Discriminant Analysis , K-nearest neighbor, Decision tree, Gaussian Naive Bayes, and support vector machine.
# import all models
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# create model lisst and add models
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto',max_iter=500))) # increase the max iteration of SVM
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy',n_jobs=-1,)
results.append(cv_results)
names.append(name)
print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
plt.boxplot(results, labels=names)
plt.title('Algorithm Comparison')
plt.show()
K-nearest neighbor has the best performance in the classification algorithm comparision, our next step is try to improve it's performance by tunning the hyperparameters. Also I will add Xgboost and Multilayer Artificial Neural Network algorithms.
KNN
K neatest neighbor is a simple classification technique. There are few hyperparameters need to be tuned including numbers of neighbors, leaf size, Power parameter, and weigts.
from sklearn.model_selection import GridSearchCV
from sklearn.utils import parallel_backend
from sklearn.neighbors import KNeighborsClassifier
#using grid search to get the best estimator
clf1 = KNeighborsClassifier()
#create param dist to pass through grid search
param_dist = {
'n_neighbors': (1,40, 1),
'leaf_size': (1,40,1),
'p': (1,2),
'weights': ('uniform', 'distance'),
'metric': ('minkowski', 'chebyshev'),
}
grid = GridSearchCV(clf1,param_dist,cv = 3,scoring = 'neg_log_loss',n_jobs=1)
with parallel_backend('threading'):
grid.fit(X_train,y_train)
best_estimator = grid.best_estimator_
print(best_estimator)
KNeighborsClassifier(leaf_size=1, n_neighbors=40, p=1, weights='distance')
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
KNN=grid.fit(X_train, y_train)
y_pred_KNN1 =KNN.predict(X_test)
print('BEST K-NEAREST NEIGHBORS MODEL')
print('Accuracy Score - KNN:', metrics.accuracy_score(y_test, y_pred_KNN1))
BEST K-NEAREST NEIGHBORS MODEL Accuracy Score - KNN: 0.8194444444444444
The overall accuracy is good at 0.81, The accuracy on lodgepole pine and spruce/fir can be improved
metrics.plot_confusion_matrix(grid,X_test, y_test, display_labels=["lodgepole pine", "spruce/fir", "ponderosa pine", "Douglas-fir", "aspen", "cottonwood/willow", "krummholz"])
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x29a0f657508>
I also ploted a multiclass ROC curve from the yellowbrick package, the results are optimal for all kinds of forest covers.
from yellowbrick.classifier import ROCAUC
visualizer = ROCAUC(grid, classes=["lodgepole pine", "spruce/fir", "ponderosa pine", "Douglas-fir", "aspen", "cottonwood/willow", "krummholz"])
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
visualizer.show() # Finalize and show the figure
<AxesSubplot:title={'center':'ROC Curves for GridSearchCV'}, xlabel='False Positive Rate', ylabel='True Positive Rate'>
XGBoost
XGBoost is a tree based method that will perform good on many high dimension classification tasks.
import xgboost as xgb
xg_reg = xgb.XGBClassifier(colsample_bytree = 0.5, learning_rate = 0.1,
max_depth = 10, alpha = 5, n_estimators = 100,eval_metric='merror',use_label_encoder=False)
#label must be in [0,num_classs)
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
np.mean(preds == y_test)
0.8465608465608465
#import packages
import optuna
from optuna import Trial, visualization
from optuna.samplers import TPESampler
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
# define the objective for optuna to tune
def objective(trial: Trial,X_train,y_train,X_test,y_test):
param = {
"n_estimators" : trial.suggest_int("n_estimators", 0, 1000),
'max_depth':trial.suggest_int('max_depth', 2, 25),
'reg_alpha':trial.suggest_int('reg_alpha', 0, 5),
'reg_lambda':trial.suggest_int('reg_lambda', 0, 5),
'min_child_weight':trial.suggest_int('min_child_weight', 0, 5),
'gamma':trial.suggest_int('gamma', 0, 5),
'learning_rate':trial.suggest_loguniform('learning_rate',0.005,0.5),
'colsample_bytree':trial.suggest_discrete_uniform('colsample_bytree',0.1,1,0.01),
'nthread' : -1
}
model = XGBClassifier(**param,eval_metric='merror',use_label_encoder=False)
model.fit(X_train,y_train)
return cross_val_score(model,X_test,y_test).mean()
# create optuna study and optimize it
study = optuna.create_study(direction='maximize',sampler=TPESampler())
study.optimize(lambda trial : objective(trial,X_train,y_train,X_test,y_test),n_trials= 50)
[I 2021-04-26 00:58:42,582] A new study created in memory with name: no-name-556336f8-4e2b-47b6-b080-05f9b0c43e6c [I 2021-04-26 00:59:29,610] Trial 0 finished with value: 0.7533085217010564 and parameters: {'n_estimators': 918, 'max_depth': 7, 'reg_alpha': 5, 'reg_lambda': 5, 'min_child_weight': 1, 'gamma': 0, 'learning_rate': 0.012674196295209331, 'colsample_bytree': 0.24000000000000002}. Best is trial 0 with value: 0.7533085217010564. [I 2021-04-26 00:59:49,046] Trial 1 finished with value: 0.7728170324558044 and parameters: {'n_estimators': 175, 'max_depth': 24, 'reg_alpha': 0, 'reg_lambda': 5, 'min_child_weight': 3, 'gamma': 0, 'learning_rate': 0.07782964743844427, 'colsample_bytree': 0.30000000000000004}. Best is trial 1 with value: 0.7728170324558044. [I 2021-04-26 01:00:28,509] Trial 2 finished with value: 0.7711646872092386 and parameters: {'n_estimators': 344, 'max_depth': 13, 'reg_alpha': 1, 'reg_lambda': 0, 'min_child_weight': 3, 'gamma': 2, 'learning_rate': 0.0073954669405211495, 'colsample_bytree': 0.48}. Best is trial 1 with value: 0.7728170324558044. [I 2021-04-26 01:00:37,875] Trial 3 finished with value: 0.72884078594494 and parameters: {'n_estimators': 386, 'max_depth': 2, 'reg_alpha': 3, 'reg_lambda': 3, 'min_child_weight': 2, 'gamma': 1, 'learning_rate': 0.22374543969240085, 'colsample_bytree': 0.54}. Best is trial 1 with value: 0.7728170324558044. [I 2021-04-26 01:02:20,132] Trial 4 finished with value: 0.7390898144600733 and parameters: {'n_estimators': 678, 'max_depth': 22, 'reg_alpha': 0, 'reg_lambda': 5, 'min_child_weight': 3, 'gamma': 5, 'learning_rate': 0.012644195914056654, 'colsample_bytree': 0.67}. Best is trial 1 with value: 0.7728170324558044. [I 2021-04-26 01:03:21,432] Trial 5 finished with value: 0.7275195665261891 and parameters: {'n_estimators': 567, 'max_depth': 22, 'reg_alpha': 5, 'reg_lambda': 0, 'min_child_weight': 2, 'gamma': 5, 'learning_rate': 0.25677057694080363, 'colsample_bytree': 0.49}. Best is trial 1 with value: 0.7728170324558044. [I 2021-04-26 01:04:45,917] Trial 6 finished with value: 0.7519851130206338 and parameters: {'n_estimators': 891, 'max_depth': 15, 'reg_alpha': 0, 'reg_lambda': 3, 'min_child_weight': 5, 'gamma': 5, 'learning_rate': 0.10984835204411156, 'colsample_bytree': 0.27}. Best is trial 1 with value: 0.7728170324558044. [I 2021-04-26 01:05:15,132] Trial 7 finished with value: 0.7675255869957857 and parameters: {'n_estimators': 331, 'max_depth': 16, 'reg_alpha': 4, 'reg_lambda': 1, 'min_child_weight': 4, 'gamma': 1, 'learning_rate': 0.33969740923671515, 'colsample_bytree': 0.52}. Best is trial 1 with value: 0.7728170324558044. [I 2021-04-26 01:06:21,665] Trial 8 finished with value: 0.712633681790816 and parameters: {'n_estimators': 649, 'max_depth': 9, 'reg_alpha': 4, 'reg_lambda': 5, 'min_child_weight': 1, 'gamma': 5, 'learning_rate': 0.005783303335429581, 'colsample_bytree': 0.8}. Best is trial 1 with value: 0.7728170324558044. [I 2021-04-26 01:06:37,726] Trial 9 finished with value: 0.7470242460730119 and parameters: {'n_estimators': 710, 'max_depth': 2, 'reg_alpha': 2, 'reg_lambda': 0, 'min_child_weight': 4, 'gamma': 0, 'learning_rate': 0.04494297188637666, 'colsample_bytree': 0.41000000000000003}. Best is trial 1 with value: 0.7728170324558044. [I 2021-04-26 01:06:40,670] Trial 10 finished with value: 0.650461386897269 and parameters: {'n_estimators': 24, 'max_depth': 24, 'reg_alpha': 1, 'reg_lambda': 4, 'min_child_weight': 0, 'gamma': 3, 'learning_rate': 0.04040394738382723, 'colsample_bytree': 0.13}. Best is trial 1 with value: 0.7728170324558044. [I 2021-04-26 01:06:50,860] Trial 11 finished with value: 0.7519894915439768 and parameters: {'n_estimators': 107, 'max_depth': 11, 'reg_alpha': 1, 'reg_lambda': 2, 'min_child_weight': 3, 'gamma': 3, 'learning_rate': 0.113093679688585, 'colsample_bytree': 0.33}. Best is trial 1 with value: 0.7728170324558044. [I 2021-04-26 01:07:07,958] Trial 12 finished with value: 0.7215631328334519 and parameters: {'n_estimators': 215, 'max_depth': 18, 'reg_alpha': 1, 'reg_lambda': 1, 'min_child_weight': 4, 'gamma': 2, 'learning_rate': 0.02119371895583926, 'colsample_bytree': 0.14}. Best is trial 1 with value: 0.7728170324558044. [I 2021-04-26 01:07:39,728] Trial 13 finished with value: 0.7771123638552898 and parameters: {'n_estimators': 206, 'max_depth': 19, 'reg_alpha': 0, 'reg_lambda': 2, 'min_child_weight': 3, 'gamma': 2, 'learning_rate': 0.09115005742239161, 'colsample_bytree': 0.6799999999999999}. Best is trial 13 with value: 0.7771123638552898. [I 2021-04-26 01:08:02,373] Trial 14 finished with value: 0.7757927863827925 and parameters: {'n_estimators': 154, 'max_depth': 25, 'reg_alpha': 0, 'reg_lambda': 2, 'min_child_weight': 5, 'gamma': 1, 'learning_rate': 0.0957809577045866, 'colsample_bytree': 0.92}. Best is trial 13 with value: 0.7771123638552898. [I 2021-04-26 01:08:04,329] Trial 15 finished with value: 0.7420683049641508 and parameters: {'n_estimators': 13, 'max_depth': 19, 'reg_alpha': 2, 'reg_lambda': 2, 'min_child_weight': 5, 'gamma': 1, 'learning_rate': 0.15610765865911327, 'colsample_bytree': 1.0}. Best is trial 13 with value: 0.7771123638552898. [I 2021-04-26 01:08:44,232] Trial 16 finished with value: 0.773480926057687 and parameters: {'n_estimators': 254, 'max_depth': 20, 'reg_alpha': 0, 'reg_lambda': 1, 'min_child_weight': 5, 'gamma': 2, 'learning_rate': 0.06087547683559619, 'colsample_bytree': 1.0}. Best is trial 13 with value: 0.7771123638552898. [I 2021-04-26 01:09:52,067] Trial 17 finished with value: 0.7324755076350501 and parameters: {'n_estimators': 473, 'max_depth': 25, 'reg_alpha': 2, 'reg_lambda': 2, 'min_child_weight': 4, 'gamma': 4, 'learning_rate': 0.45049155894793963, 'colsample_bytree': 0.86}. Best is trial 13 with value: 0.7771123638552898. [I 2021-04-26 01:10:14,184] Trial 18 finished with value: 0.779758633900717 and parameters: {'n_estimators': 124, 'max_depth': 21, 'reg_alpha': 0, 'reg_lambda': 3, 'min_child_weight': 1, 'gamma': 1, 'learning_rate': 0.030332552916299056, 'colsample_bytree': 0.69}. Best is trial 18 with value: 0.779758633900717. [I 2021-04-26 01:10:27,236] Trial 19 finished with value: 0.7447134803787423 and parameters: {'n_estimators': 73, 'max_depth': 17, 'reg_alpha': 1, 'reg_lambda': 4, 'min_child_weight': 0, 'gamma': 3, 'learning_rate': 0.023236508816652644, 'colsample_bytree': 0.6799999999999999}. Best is trial 18 with value: 0.779758633900717. [I 2021-04-26 01:11:06,457] Trial 20 finished with value: 0.7539707733566854 and parameters: {'n_estimators': 264, 'max_depth': 21, 'reg_alpha': 3, 'reg_lambda': 3, 'min_child_weight': 1, 'gamma': 2, 'learning_rate': 0.030539674960272926, 'colsample_bytree': 0.7}. Best is trial 18 with value: 0.779758633900717. [I 2021-04-26 01:11:30,330] Trial 21 finished with value: 0.7837244814186415 and parameters: {'n_estimators': 138, 'max_depth': 25, 'reg_alpha': 0, 'reg_lambda': 2, 'min_child_weight': 2, 'gamma': 1, 'learning_rate': 0.0770768808537012, 'colsample_bytree': 0.87}. Best is trial 21 with value: 0.7837244814186415. [I 2021-04-26 01:11:47,256] Trial 22 finished with value: 0.7830671556017734 and parameters: {'n_estimators': 101, 'max_depth': 22, 'reg_alpha': 0, 'reg_lambda': 4, 'min_child_weight': 2, 'gamma': 1, 'learning_rate': 0.06338027838758882, 'colsample_bytree': 0.77}. Best is trial 21 with value: 0.7837244814186415. [I 2021-04-26 01:11:51,882] Trial 23 finished with value: 0.7675305128345465 and parameters: {'n_estimators': 28, 'max_depth': 23, 'reg_alpha': 0, 'reg_lambda': 4, 'min_child_weight': 2, 'gamma': 1, 'learning_rate': 0.057279194424913006, 'colsample_bytree': 0.78}. Best is trial 21 with value: 0.7837244814186415. [I 2021-04-26 01:12:15,204] Trial 24 finished with value: 0.7807503694379071 and parameters: {'n_estimators': 127, 'max_depth': 25, 'reg_alpha': 1, 'reg_lambda': 4, 'min_child_weight': 1, 'gamma': 0, 'learning_rate': 0.034955112472050645, 'colsample_bytree': 0.9}. Best is trial 21 with value: 0.7837244814186415. [I 2021-04-26 01:12:15,919] Trial 25 finished with value: 0.7447156696404138 and parameters: {'n_estimators': 4, 'max_depth': 25, 'reg_alpha': 1, 'reg_lambda': 4, 'min_child_weight': 2, 'gamma': 0, 'learning_rate': 0.17172304288424253, 'colsample_bytree': 0.92}. Best is trial 21 with value: 0.7837244814186415. [I 2021-04-26 01:13:33,558] Trial 26 finished with value: 0.7804203382409283 and parameters: {'n_estimators': 436, 'max_depth': 23, 'reg_alpha': 1, 'reg_lambda': 4, 'min_child_weight': 1, 'gamma': 0, 'learning_rate': 0.016935107468480263, 'colsample_bytree': 0.91}. Best is trial 21 with value: 0.7837244814186415. [I 2021-04-26 01:14:21,697] Trial 27 finished with value: 0.7860396256362542 and parameters: {'n_estimators': 301, 'max_depth': 25, 'reg_alpha': 2, 'reg_lambda': 3, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.06421846317471273, 'colsample_bytree': 0.78}. Best is trial 27 with value: 0.7860396256362542. [I 2021-04-26 01:14:40,014] Trial 28 finished with value: 0.7592583876087788 and parameters: {'n_estimators': 297, 'max_depth': 5, 'reg_alpha': 3, 'reg_lambda': 3, 'min_child_weight': 0, 'gamma': 1, 'learning_rate': 0.06935040907538659, 'colsample_bytree': 0.77}. Best is trial 27 with value: 0.7860396256362542. [I 2021-04-26 01:15:07,951] Trial 29 finished with value: 0.7830682502326091 and parameters: {'n_estimators': 420, 'max_depth': 23, 'reg_alpha': 2, 'reg_lambda': 3, 'min_child_weight': 2, 'gamma': 0, 'learning_rate': 0.131808424452916, 'colsample_bytree': 0.6}. Best is trial 27 with value: 0.7860396256362542. [I 2021-04-26 01:15:32,855] Trial 30 finished with value: 0.7837277653111488 and parameters: {'n_estimators': 545, 'max_depth': 20, 'reg_alpha': 2, 'reg_lambda': 1, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.2036033316178124, 'colsample_bytree': 0.59}. Best is trial 27 with value: 0.7860396256362542. [I 2021-04-26 01:16:03,926] Trial 31 finished with value: 0.7837277653111488 and parameters: {'n_estimators': 561, 'max_depth': 23, 'reg_alpha': 2, 'reg_lambda': 1, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.14252441911009103, 'colsample_bytree': 0.62}. Best is trial 27 with value: 0.7860396256362542. [I 2021-04-26 01:16:28,472] Trial 32 finished with value: 0.7843916589130316 and parameters: {'n_estimators': 549, 'max_depth': 20, 'reg_alpha': 2, 'reg_lambda': 1, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.1872289931369457, 'colsample_bytree': 0.6}. Best is trial 27 with value: 0.7860396256362542. [I 2021-04-26 01:16:49,128] Trial 33 finished with value: 0.7820743254337476 and parameters: {'n_estimators': 557, 'max_depth': 20, 'reg_alpha': 2, 'reg_lambda': 1, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.24319598020770508, 'colsample_bytree': 0.61}. Best is trial 27 with value: 0.7860396256362542. [I 2021-04-26 01:17:06,963] Trial 34 finished with value: 0.7734776421651798 and parameters: {'n_estimators': 766, 'max_depth': 13, 'reg_alpha': 3, 'reg_lambda': 1, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.3808095991760999, 'colsample_bytree': 0.42000000000000004}. Best is trial 27 with value: 0.7860396256362542. [I 2021-04-26 01:17:28,513] Trial 35 finished with value: 0.7850489847298998 and parameters: {'n_estimators': 562, 'max_depth': 15, 'reg_alpha': 2, 'reg_lambda': 0, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.18951929722445632, 'colsample_bytree': 0.59}. Best is trial 27 with value: 0.7860396256362542. [I 2021-04-26 01:17:48,643] Trial 36 finished with value: 0.7764544907230035 and parameters: {'n_estimators': 646, 'max_depth': 14, 'reg_alpha': 3, 'reg_lambda': 0, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.2910798518595401, 'colsample_bytree': 0.44000000000000006}. Best is trial 27 with value: 0.7860396256362542. [I 2021-04-26 01:18:13,777] Trial 37 finished with value: 0.785711236385529 and parameters: {'n_estimators': 807, 'max_depth': 11, 'reg_alpha': 2, 'reg_lambda': 0, 'min_child_weight': 1, 'gamma': 0, 'learning_rate': 0.16688503013487335, 'colsample_bytree': 0.5700000000000001}. Best is trial 27 with value: 0.7860396256362542. [I 2021-04-26 01:18:31,528] Trial 38 finished with value: 0.7771145531169614 and parameters: {'n_estimators': 867, 'max_depth': 11, 'reg_alpha': 3, 'reg_lambda': 0, 'min_child_weight': 0, 'gamma': 0, 'learning_rate': 0.4855703182059533, 'colsample_bytree': 0.55}. Best is trial 27 with value: 0.7860396256362542. [I 2021-04-26 01:19:44,031] Trial 39 finished with value: 0.7645476438071261 and parameters: {'n_estimators': 996, 'max_depth': 7, 'reg_alpha': 4, 'reg_lambda': 0, 'min_child_weight': 1, 'gamma': 1, 'learning_rate': 0.19067353179320293, 'colsample_bytree': 0.47}. Best is trial 27 with value: 0.7860396256362542. [I 2021-04-26 01:20:02,610] Trial 40 finished with value: 0.7866958568222866 and parameters: {'n_estimators': 806, 'max_depth': 10, 'reg_alpha': 2, 'reg_lambda': 0, 'min_child_weight': 1, 'gamma': 0, 'learning_rate': 0.3337983388597914, 'colsample_bytree': 0.55}. Best is trial 40 with value: 0.7866958568222866. [I 2021-04-26 01:20:20,397] Trial 41 finished with value: 0.7840545126156203 and parameters: {'n_estimators': 812, 'max_depth': 11, 'reg_alpha': 2, 'reg_lambda': 0, 'min_child_weight': 1, 'gamma': 0, 'learning_rate': 0.34186894860277733, 'colsample_bytree': 0.56}. Best is trial 40 with value: 0.7866958568222866. [I 2021-04-26 01:20:36,316] Trial 42 finished with value: 0.7853801105577144 and parameters: {'n_estimators': 620, 'max_depth': 9, 'reg_alpha': 2, 'reg_lambda': 0, 'min_child_weight': 1, 'gamma': 0, 'learning_rate': 0.27025100057863477, 'colsample_bytree': 0.5}. Best is trial 40 with value: 0.7866958568222866. [I 2021-04-26 01:20:54,145] Trial 43 finished with value: 0.7741387991899731 and parameters: {'n_estimators': 728, 'max_depth': 9, 'reg_alpha': 2, 'reg_lambda': 0, 'min_child_weight': 1, 'gamma': 0, 'learning_rate': 0.298574472684073, 'colsample_bytree': 0.38}. Best is trial 40 with value: 0.7866958568222866. [I 2021-04-26 01:22:25,797] Trial 44 finished with value: 0.7744699250177879 and parameters: {'n_estimators': 979, 'max_depth': 9, 'reg_alpha': 3, 'reg_lambda': 0, 'min_child_weight': 1, 'gamma': 1, 'learning_rate': 0.2351675743191006, 'colsample_bytree': 0.48}. Best is trial 40 with value: 0.7866958568222866. [I 2021-04-26 01:22:39,983] Trial 45 finished with value: 0.7827371244047946 and parameters: {'n_estimators': 624, 'max_depth': 6, 'reg_alpha': 2, 'reg_lambda': 0, 'min_child_weight': 1, 'gamma': 0, 'learning_rate': 0.40035610111976555, 'colsample_bytree': 0.52}. Best is trial 40 with value: 0.7866958568222866. [I 2021-04-26 01:23:08,322] Trial 46 finished with value: 0.7787674456789448 and parameters: {'n_estimators': 829, 'max_depth': 10, 'reg_alpha': 2, 'reg_lambda': 0, 'min_child_weight': 1, 'gamma': 0, 'learning_rate': 0.12356767782495835, 'colsample_bytree': 0.37}. Best is trial 40 with value: 0.7866958568222866. [I 2021-04-26 01:26:17,322] Trial 47 finished with value: 0.7754627551858135 and parameters: {'n_estimators': 925, 'max_depth': 13, 'reg_alpha': 1, 'reg_lambda': 0, 'min_child_weight': 0, 'gamma': 1, 'learning_rate': 0.2776757257842325, 'colsample_bytree': 0.74}. Best is trial 40 with value: 0.7866958568222866. [I 2021-04-26 01:26:46,183] Trial 48 finished with value: 0.7423950522686225 and parameters: {'n_estimators': 502, 'max_depth': 7, 'reg_alpha': 2, 'reg_lambda': 0, 'min_child_weight': 1, 'gamma': 4, 'learning_rate': 0.0898444247604902, 'colsample_bytree': 0.22}. Best is trial 40 with value: 0.7866958568222866. [I 2021-04-26 01:27:29,538] Trial 49 finished with value: 0.7724930217284222 and parameters: {'n_estimators': 352, 'max_depth': 12, 'reg_alpha': 3, 'reg_lambda': 0, 'min_child_weight': 1, 'gamma': 0, 'learning_rate': 0.01012430886391942, 'colsample_bytree': 0.65}. Best is trial 40 with value: 0.7866958568222866.
print('Best trial: score {},\nparams {}'.format(study.best_trial.value,study.best_trial.params))
Best trial: score 0.7866958568222866,
params {'n_estimators': 806, 'max_depth': 10, 'reg_alpha': 2, 'reg_lambda': 0, 'min_child_weight': 1, 'gamma': 0, 'learning_rate': 0.3337983388597914, 'colsample_bytree': 0.55}
This is a optimization history plot of all trials as well as the best score at each point.
optuna.visualization.plot_optimization_history(study)
optuna.visualization.plot_slice(study)
The accuracy of the model is 87%
model = XGBClassifier(**study.best_trial.params,eval_metric='merror',use_label_encoder=False)
model.fit(X_train,y_train)
preds = model.predict(X_test)
np.mean(preds == y_test)
0.8707010582010583
Xgboost is definitely better than KNN at classifying this dataset, we see less classification error on Lodgepole pine and Spruce/Fir here.
metrics.plot_confusion_matrix(model,X_test, y_test, display_labels=["lodgepole pine", "spruce/fir", "ponderosa pine", "Douglas-fir", "aspen", "cottonwood/willow", "krummholz"])
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x29a7b6b1f48>
from yellowbrick.classifier import ROCAUC
visualizer = ROCAUC(model, classes=["lodgepole pine", "spruce/fir", "ponderosa pine", "Douglas-fir", "aspen", "cottonwood/willow", "krummholz"])
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
visualizer.show() # Finalize and show the figure
<AxesSubplot:title={'center':'ROC Curves for XGBClassifier'}, xlabel='False Positive Rate', ylabel='True Positive Rate'>
Neural Network
Artifical Neural Network usually underperform on structured data, but we will give it a try here.
import tensorflow as tf
from tensorflow import keras
import kerastuner as kt
#Check GPU usage
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
Num GPUs Available: 1
Here we are using keras tuner to optimize the number of layers, neuron in each layer, learning rate.
def model_builder(hp):
model = keras.Sequential()
# Tune the number of units in the first Dense layer
# Choose an optimal value between 32-512
for i in range(hp.Int('num_layers', 1, 15)):
model.add(keras.layers.Dense(units=hp.Int('units_' + str(i),
min_value=16,
max_value=512,
step=16),
activation='relu'))
model.add(keras.layers.Dense(7, activation='softmax'))
# Tune the learning rate for the optimizer
# Choose an optimal value from 0.01, 0.001, or 0.0001
hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])
model.compile(optimizer=keras.optimizers.Adam(learning_rate=hp_learning_rate),
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
return model
tuner = kt.Hyperband(
model_builder,
max_epochs=30,
factor=3,
objective='val_accuracy',
directory='forest_cover5',
project_name='kt')
tuner.search(X_train,y_train,
epochs=50,
validation_split=0.2)
Trial 90 Complete [00h 00m 40s] val_accuracy: 0.14669421315193176 Best val_accuracy So Far: 0.7376033067703247 Total elapsed time: 00h 14m 23s INFO:tensorflow:Oracle triggered exit
# Retrieve the best model.
best_model = tuner.get_best_models(num_models=1)[0]
# Evaluate the best model.
loss, accuracy = best_model.evaluate(X_test, y_test)
95/95 [==============================] - 1s 3ms/step - loss: 0.5878 - accuracy: 0.7504
The performance is not quite good as XGBoost, but it is on par with K-nearest neighbor. However, since K-nearest neighbor is a much easier method computationaly. K-neareset neighbor is definitely a better choice than ANN.